This tutorial will demonstrate how to analyse audio data for acoustic phonetic studies in R. It is mainly intended to demonstrated possible workflows. The topics covered are:
emuR and EMU-SDMS, the EMU Speech Database Management Systemserve()ggplot2The generated HTML page of this tutorial is available here.
This tutorial is organised as an R Markdown notebook. To execute the code within this notebook (filename Lecture.Rmd) it has to be opend in RStudio. When executing code, the results appear beneath the code. In order to do so, you need to have R and RStudio installed. When this R notebook is loaded into RStudio, you can excecute chunks by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter, allowing you to experiment with the code. This handout contains all output of the code (tables, visualisations etc.). The easiest way to work with this R code is to clone the entire project from this GitHub repository.
The following libraries will be needed and have to be installed.
library(tidyverse)
library(emuR)
library(tools)
library(rio)
library(ggplot2)
library(magrittr)
library(htmltools)
library(joeyr)
library(knitr)
Before starting to work with the database, let’s have a quick overview how the automatic segmentation of audio files into smaller chunks like words and phonetic segments actually is working.
We are using the MAUS tools from the University of Munich. Next to a web interface, MAUS segmentation can be called directly from within R through an API.
The input is a stream of audio recording alongside of a written representation of the words spoken in this recording. A trained model of the language, here: Luxembourgish, then is used to map the sounds and words derived from the text to the speech signal. The output is an object with temporal alignments between words and sounds. This is a weak form of speech recognition.
Text for a test of forced alignment: Den ale Mann ass an dat kaalt Waasser gefall. (The old man fell into th cold water.) with the corresponding sound file save together in the same directory.
# create an emuDB from this text + audio
convert_txtCollection(dbName = "testDb",
sourceDir = "./testDB",
targetDir = ".")
# load this db
db <- load_emuDB("testDb_emuDB")
INFO: Checking if cache needs update for 1 sessions and 1 bundles ...
INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
INFO: Nothing to update!
# run forced aligment using BAS MAUS tools in R
runBASwebservice_all(db,
transcriptionAttributeDefinitionName = "transcription",
language ="ltz-LU",
verbose = TRUE,
runMINNI = FALSE,
resume = TRUE,
patience = 3)
Show a summary of the database.
# display the overview of the structure and content
summary(db)
Name: testDb
UUID: f6157ad6-119f-4242-9e05-41f376e31e8a
Directory: /Users/peter.gilles/Documents/Data_Science_Humanities/testDb_emuDB
Session count: 1
Bundle count: 1
Annotation item count: 53
Label count: 72
Link count: 50
Database configuration:
SSFF track definitions:
NULL
Level definitions:
name type nrOfAttrDefs attrDefNames
1 bundle ITEM 2 bundle; transcription;
2 ORT ITEM 3 ORT; KAN; KAS;
3 MAU SEGMENT 1 MAU;
4 MAS ITEM 1 MAS;
Link definitions:
type superlevelName sublevelName
1 ONE_TO_MANY bundle ORT
2 ONE_TO_MANY ORT MAS
3 ONE_TO_MANY MAS MAU
The result of the query can also be displayed in the EMU Speech Database Management System.
# open the database in a webviewer
serve(db)
The database contains the audio recordings from the Lëtzebuerger Online Dictionnaire (available here, spoken by one female speaker. The audio files have been automatically segmented with the MAUS tools. We thus have a database conisting of textual data, basically words, and the corresponding audio data. The audio data is segmented into words and phonetic segments (sounds).
This database has been created beforehand. Infos how to create such a database is explained in the EMU-SDMS manual.
While we will be working in this tutorial with the full database (26.000 recordings), a small demo database lod_emuDB has been created with 500 recordings which can be used for individual testing.
We start with loading the database and give an overview of structure and content.
# load database
db = load_emuDB("/Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB")
INFO: Checking if cache needs update for 1 sessions and 26093 bundles ...
INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
INFO: Nothing to update!
# for testing purposes, use this smaller database in the Github folder
#db = load_emuDB("lod_emuDB")
# display the overview of the structure and content
summary(db)
Name: lod
UUID: c87793c0-012a-11e9-874b-68b599b5deb4
Directory: /Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB
Session count: 1
Bundle count: 26093
Annotation item count: 515638
Label count: 639542
Link count: 489545
Database configuration:
SSFF track definitions:
name columnName fileExtension
1 dft dft dft
2 praatFms fm praatFms
3 dft2 dft dft
Level definitions:
name type nrOfAttrDefs attrDefNames
1 bundle ITEM 4 bundle; source; SAM; MAO;
2 ORT ITEM 2 ORT; KAN;
3 MAU SEGMENT 1 MAU;
Link definitions:
type superlevelName sublevelName
1 ONE_TO_MANY bundle ORT
2 ONE_TO_MANY ORT MAU
Tracks in emuR are acoustic representation of the speech signal, here dft for the waveform (time-amplitude representation) and praatFms for the formant measures of vowels (see below).
Levels in an emuR database stand for level of interlinked linguistic information. bundle is the entire audio file, ORT stands for the orthographical representation of the audio file segmented in its single words. MAU is the segmentation of all phonetic segment (=sounds) of all ORT segments in bundle.
The hierarchical structure of these levels is expressed in the link definitions as ONE-TO-MANY.
An emuR database can be queried with a powerful query engine. The first example is a simple query for one word, Aarbecht.
sl = query(db, query = "[ ORT == 'Aarbecht']")
sl
The result is a segment list (sl), containing various information about the found item (time, level, name, database info). The result of the query can also be displayed in the EMU Speech Database Management System.
serve(db, seglist = sl)
The GUI will open in the Viewer pane of RStudio or you can open it in a browser (Chrome preferred).
Here we can also display the hierarchical structure for this database item, which is accessed during queries. bundle is the top-level, representing the entire audio file.
The ORT level contains the nodes for the individual words in the bundle, here the two words Aarbecht and Aarbechten. The dependent level then is MAU (=Munich Automatic Unit) representing the single sounds of the words in ORT.
Two aspects render emuR query system extremely powerful: the use of regular expressions (including negation and other extensions) and the combined query on different levels of the database.
Let’s try more complex queries:
Regular expression, operator =~, words beginning with Aarbecht...
sl = query(db, query = "[ ORT =~ 'Aarbecht.*']")
Warning in query_labels(emuDBhandle, levelName = lvlName, intermResTableSuffix = intermResTableSuffix, : =~ now requires ^ if you wish to match the
first character in a sequence i.e. 'a.*' now also matches
'weakness' as it contains the sequence. '^a.*'
matches sequences that start with 'a.*'
e.g. the word 'amongst'.
sl
Query: Select the vowel [aː] in all words beginning with Aarbecht… Note that in the segment list the label now has changed to the vowel and the respective start-end information is now only for this sound [aː].
sl = query(db, query = "[ ORT =~ 'Aarbecht.*' ^ #MAU=='aː']")
Warning in query_labels(emuDBhandle, levelName = lvlName, intermResTableSuffix = intermResTableSuffix, : =~ now requires ^ if you wish to match the
first character in a sequence i.e. 'a.*' now also matches
'weakness' as it contains the sequence. '^a.*'
matches sequences that start with 'a.*'
e.g. the word 'amongst'.
sl
Query a sequence of sounds, e.g. e followed by k (Méck), by using the sequence operator ->.
sl = query(db, query = "[ MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]
In the sequence e->k, query only the vowel. Use the result modifier #.
sl = query(db, query = "[ #MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]
Wuery all sound items that occur at the end of a word and are [p].
sl <- query(db, query = "[End(ORT, MAU) == TRUE & MAU == p ]")
sl
Retrieve all word that contain five segments.
sl <- query(db, "[Num(ORT, MAU) == 5]")
sl
Using requery, the results from a previous query can be further specified.
# requery_seq()
# query all "m" phonetic items
sl_m = query(db, "MAU == m")
# sequential requery (left shift result by 1 (== offset of -1))
# and hence retrieve all phonetic items directly preceeding
# all "m" phonetic items
sl_req_n <- requery_seq(db,
seglist = sl_m,
offset = -1)
sl_req_n
Groups of sounds can be grouped together to label groups: - all long monophthongs: "iː", "uː", "aː", "oː", "ɔː", "ɛː", "eː" - all short monophthongs: "i", "u", "ɑ", "o", "æ", "e", "ə", "ɐ"
# query all long monophthongs
sl = query(db, "MAU == longMonophthongs")
sl
With the query the user can compile a data frame from the database which then forms the subset for further phonetic analysis. We can select e.g. all instances of certain (or all) vowels, specifying the context before or after etc. etc.
Of course, querying for individual segments in the audio file like words or sounds is possible only, if this information has been added to the database before.
The first task is to extract the time-amplitude waveform representation (oscillogram) for certain single sounds, e.g. some long aː.
# query all "aː" phonetic items
#
sl = query(db, "MAU == aː")
# instead of all 7,000 vowels take only 6
sl = sl[100:105, ]
The data for the oscillogram is stored in the SSF track dft, which is extracted by get_trackdata. The result is another R data frame.
# get "dft" track data for these segments
a_vowels = get_trackdata(emuDBhandle = db,
seglist = sl,
ssffTrackName = "MEDIAFILE_SAMPLES",
verbose = TRUE)
Warning in get_trackdata(emuDBhandle = db, seglist = sl, ssffTrackName = "MEDIAFILE_SAMPLES", : The emusegs/emuRsegs object passed in refers to bundles with in-homogeneous sampling rates in their audio files! Here is a list of all refered to bundles incl. their sampling rate:
session name annotates sample_rate
1 0000 AARBECHTSPLAZ1 AARBECHTSPLAZ1.wav 48000
2 0000 AARBECHTSPROZESS1 AARBECHTSPROZESS1.wav 48000
3 0000 AARBECHTSRECHT1 AARBECHTSRECHT1.wav 44100
md5_annot_json
1 a4df301bb5ca34b46a0b8438677334d9
2 6156d1a162658be190fa7568a00a6223
3 ef1565335d9bd950a71c7ff9c4d06540
INFO: parsing 6 wav segments/events
|
| | 0%
|
|============ | 17%
|
|======================= | 33%
|
|=================================== | 50%
|
|=============================================== | 67%
|
|========================================================== | 83%
|
|======================================================================| 100%
The oscillograms for these 6 vowel instances can then be visualised with ggplot2.
# plot oscillogram
ggplot(data = a_vowels) +
# define the df columns for the x and y values
aes(y = T1, x = times_rel) +
# line chart
geom_line() +
# sl_rowIdx groups all rows in a data frame belonging to the same segment
# labels contains the label of the segment
facet_wrap(~ sl_rowIdx + labels)
# query all words beginning with 'Fluch'
#
sl = query(db, query = "[ ORT =~ 'Fluch.*']")
Warning in query_labels(emuDBhandle, levelName = lvlName, intermResTableSuffix = intermResTableSuffix, : =~ now requires ^ if you wish to match the
first character in a sequence i.e. 'a.*' now also matches
'weakness' as it contains the sequence. '^a.*'
matches sequences that start with 'a.*'
e.g. the word 'amongst'.
# get "f0" track data for these segments, in this case calculated on the fly
fluch_words = get_trackdata(emuDBhandle = db,
seglist = sl,
# using emuR's Michel Scheffers’ Modified Harmonic Sieve algorithm
onTheFlyFunctionName = "mhsF0",
verbose = TRUE)
Warning in get_trackdata(emuDBhandle = db, seglist = sl, onTheFlyFunctionName = "mhsF0", : The emusegs/emuRsegs object passed in refers to bundles with in-homogeneous sampling rates in their audio files! Here is a list of all refered to bundles incl. their sampling rate:
session name annotates sample_rate
1 0000 FLUCH1 FLUCH1.wav 44100
2 0000 FLUCH2 FLUCH2.wav 44100
3 0000 FLUCHBILLJEE1 FLUCHBILLJEE1.wav 44100
4 0000 FLUCHBLAT1 FLUCHBLAT1.wav 44100
5 0000 FLUCHFELD1 FLUCHFELD1.wav 44100
6 0000 FLUCHGESELLSCHAFT1 FLUCHGESELLSCHAFT1.wav 44100
7 0000 FLUCHHAFEN1 FLUCHHAFEN1.wav 44100
8 0000 FLUCHROUTE1 FLUCHROUTE1.wav 48000
9 0000 FLUCHSCHAIN1 FLUCHSCHAIN1.wav 48000
10 0000 FLUCHT1 FLUCHT1.wav 44100
11 0000 FLUCHTGEFOR1 FLUCHTGEFOR1.wav 44100
12 0000 FLUCHTICKET1 FLUCHTICKET1.wav 44100
md5_annot_json
1 4f96300f1fd30205575bd557c6464ed5
2 d4d6ab259206af2592edab568a1f3169
3 2c9d18baef54d1ed2dfe97af377dbb47
4 d21730527f22712aaa0e002341f3831c
5 3289c832ff6ff8b42d1803163d544cce
6 ef31e44d088bf2461cdae5284db0b1c5
7 44ca764332ca16eb00f72978113cd908
8 2d055afc598b0f4c44fa38afd7b295da
9 82e2fff18e70a74058e978932dea9e50
10 17ecb177ca527fc055c778f8708be1ae
11 836a096c2af4b9dd1acdfe6163e3570e
12 6896b5678da288e71af86bf414c67fb1
INFO: applying mhsF0 to 21 segments/events
|
| | 0%
|
|=== | 5%
|
|======= | 10%
|
|========== | 14%
|
|============= | 19%
|
|================= | 24%
|
|==================== | 29%
|
|======================= | 33%
|
|=========================== | 38%
|
|============================== | 43%
|
|================================= | 48%
|
|===================================== | 52%
|
|======================================== | 57%
|
|=========================================== | 62%
|
|=============================================== | 67%
|
|================================================== | 71%
|
|===================================================== | 76%
|
|========================================================= | 81%
|
|============================================================ | 86%
|
|=============================================================== | 90%
|
|=================================================================== | 95%
|
|======================================================================| 100%
In the corresponding graphs then the track for the fundamental frequency (f0) for some isolated words will be drawn.
# plot f0 tracks
ggplot(data = fluch_words) +
# define the df columns for the x and y values
# T1 here is the f0 value
aes(y = T1, x = times_rel) +
# line chart
geom_line() +
# sl_rowIdx groups all rows in a data frame belonging to the same segment
# labels contains the label of the segment
facet_wrap(~ sl_rowIdx + labels)
If needed, you can check the content of your segment list with the EMU WebApp running serve(seglist = sl).
The study of duration of speech segments is a standard task in acoustic phonetics. Let’s see how to solve this in R. Thanks to the label groups which are available in the database, we have easy access to e.g. all shortMonophthongs and all longMonophthongs.
# query all long and short vowels
longMonophthongs = query(db, "MAU == longMonophthongs")
shortMonophthongs = query(db, "MAU == shortMonophthongs")
# bind the two together to one segment list sl
sl = rbind(longMonophthongs, shortMonophthongs)
# print the count; number of rows
nrow(longMonophthongs)
[1] 22950
nrow(shortMonophthongs)
[1] 104115
The data frame contains per segment a start and end column, which allows for simple calculation of the segment duration.
# calculate durations
duration_long = longMonophthongs$end - longMonophthongs$start
duration_short = shortMonophthongs$end - shortMonophthongs$start
# calculate mean and standard deviation
paste("The mean duration for long monophthongs in the databse is", round(mean(duration_long)), "ms. The standard deviation is:", sd(duration_long), ".")
[1] "The mean duration for long monophthongs in the databse is 153 ms. The standard deviation is: 62.9194511796307 ."
The mean duration for long monophthongs in the databse is 153 ms. The standard deviation is: 62.9194511796307 .
paste("The mean duration for short monophthongs in the databse is", round(mean(duration_short)), "ms. The standard deviation is:", sd(duration_short), ".")
[1] "The mean duration for short monophthongs in the databse is 106 ms. The standard deviation is: 51.422737823314 ."
The mean duration for short monophthongs in the databse is 106 ms. The standard deviation is: 51.422737823314 .
Instead of having these broad categories long and short we can break them down to the individual vowels. The dataframes just created contain the individual vowels already in the labels column and can thus be easily utilised.
head(longMonophthongs)
First, we create an aggregated dataframe with all vowel segments. This aggregated dataframe now contains 127,065 rows (= long/short vowels).
# add new columns for length, either long or short
longMonophthongs$length <- "long"
shortMonophthongs$length <- "short"
# then merge the two data frame into a single one
all_vowels <- rbind(longMonophthongs, shortMonophthongs)
# add a new colum for segment duration
all_vowels$duration <- round(all_vowels$end - all_vowels$start)
# todo: keep propper sl throughout, will be needed later for requeries!
sl <- all_vowels %>% filter(!str_detect(bundle, "^[CDDEF].*"))
sl$length <- NULL
sl$duration <- NULL
Using the powerful library dplyr, we can now calculate the means for all vowels (= labels).
duration_vowels <- all_vowels %>%
group_by(labels, length) %>%
summarise(mean = mean(duration), sd = sd(duration))
`summarise()` has grouped output by 'labels'. You can override using the `.groups` argument.
duration_vowels
Finally, visualise the means in a bar chart
# basic setting for ggplot with data and aesthetics
ggplot(data = duration_vowels, aes(x= labels, y= mean, fill = length)) +
geom_bar(stat="identity", position=position_dodge()) +
# add errorbars
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=.2,
position=position_dodge(.9)) +
theme_minimal()
From here on the duration values can be utilised in further statistical analyses.
ggplot2Formant frequencies are the acoustic representation of the vowel’s quality. Of these, the first two formants are crucial for the vowels identity, i.e. F1 and F2. The values for the formants are already calculated in the database and will now be extracted for all the vowels our dataframe all_vowels.
As it takes long to do this, the dataframe has been precompiled and saved to file … and reloaded again.
# saveRDS(td_vowels, file ="td_vowels.RDS")
# td_vowels <- readRDS("td_vowels.RDS")
Due to the varying length of all vowels, it is necessary to normalize the length with normalize_length. This step takes long and, again, the dataframe has already been created and saved.
# normalise the length
# td_vowels_norm = normalize_length(td_vowels %>% select(-duration, -length), colNames = c("T1", "T2"))
# do some cleaning
# exclude words spoken by another (male) speaker
# td_vowels_norm <- td_vowels_norm %>%
# filter(!str_detect(bundle, "^[CDDEF].*"))
# save this dataframe
#saveRDS(td_vowels_norm, file ="td_vowels_norm.RDS")
# load this data frame
td_vowels_norm <- readRDS("td_vowels_norm.RDS")
Formants are calculated for every sampling point of the audio file. Thus, the dataframe td_vowels_norm now contains some 2,1 million rows/values.
For illustration purposes, the formants for the long vowels in three example words are selected into a dataframe (maachen, Viz, Kuuscht).
formant_examples <- td_vowels_norm %>%
filter(sl_rowIdx == "722" | sl_rowIdx == "21702" | sl_rowIdx == "11191")
These three vowels then are visualised as line charts.
ggplot(data=formant_examples) +
geom_smooth(aes(x = times_norm, y = T1, col = labels, group = labels)) +
geom_smooth(aes(x = times_norm, y = T2, col = labels, group = labels)) +
labs(x = "Duration (normalized)", y = "F1 + F2 (Hz)") +
ggtitle("Example formants F1, F2 the three corner vowels") +
facet_wrap(~ labels)
To get the identity of a vowel, it is common practice to select only one point in the middle of the vowel. To do so, I am using a filter and select the F1 and F2 values at the normalised time point 0.4. This gives us F1 and F2 values for 104,055 vowels.
vowel_midpoints = td_vowels_norm %>%
filter(times_norm == 0.4)
vowel_midpoints$labels <- as.factor(vowel_midpoints$labels)
Again, some further data conversion.
# convert Hertz to Bark
# convert T1, T2, T3 to Bark
td_vowels_norm$T1 <- emuR::bark(td_vowels_norm$T1)
td_vowels_norm$T2 <- emuR::bark(td_vowels_norm$T2)
# insert new columns from requeries to have the info about the next and next_next and previous label after the vowel available in the dataframe
vowel_midpoints$next_label = as.factor(requery_seq(db, seglist = sl, offset=1)$labels)
vowel_midpoints$next_next_label = as.factor(requery_seq(db, seglist = sl, offset=2, ignoreOutOfBounds =TRUE)$labels)
vowel_midpoints$word = requery_hier(db, seglist = sl, level="ORT")$labels
vowel_midpoints$previous_label <- as.factor(requery_seq(db, seglist = sl, offset= -1)$labels)
The dataframe vowel_midpoints is now enriched with additional columns for the actual word (of a vowel instance), previous_label, next_label and next_next_label, which allows to select data more fine-grained for the following visualisations, e.g. all vowels aː with a g in front and an R after it.
head(vowel_midpoints)
Prepared in such a way, the visualisations of this sheer amount of +100,000 vowels can commence. Vowel quality is charted in scatter plots, where the F1 value is plotted on the y-axis and the F2 value on the y-axis. To gain resemblance with the articulation of vowels, both axes are reversed. Thus i is located in the top front, u in the top back and a in the low position.
The result looks rather flashy, but nevertheless has some flaws. First, there are several outliers, especially in the SW corner. An elimination procedure for outliers is needed. Secondly, the single clouds for each vowels are overlapping each other to such an extent that the proper location is not identifiable.
# takes too long: do not run with full dataset!
ggplot(vowel_midpoints) +
aes(x = T2, y = T1, label = labels, col = labels) +
geom_text() +
scale_y_reverse() +
scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
labs(x = "F2 (Hz)", y = "F1 (Hz)") +
ggtitle("Scatter plot of all vowels in the LOD database")
Next task: remove outliers. I am using a function based on the (Mahalnobis distance)[https://en.wikipedia.org/wiki/Mahalanobis_distance], which incrementally excludes outliers up to a predefined threshold. In the present case we keep 85 % of the values.
# outlier computations takes also a while
# to be on the safe side, also save this dataframe to disk
# saveRDS(without_outliers, file="without_outliers.RDS")
without_outliers <- readRDS("without_outliers.RDS")
without_outliers contains some 88,000 rows, meaning nearly 20,000 have been identified and removed as outliers.
To reduce the amount of data points in the scatter plot, one can draw contour plots (following this idea). This method geom_density2d facilitates to identify the core zones of vowel realisations, i.e. where - similar to a weather chart - the lines are closer together.
ggplot(without_outliers, aes(x = T2, y = T1, color=labels)) +
geom_density2d(aes(label= labels)) +
scale_y_reverse() +
scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
labs(x = "F2 (Bark)", y = "F1 (Bark)") +
ggtitle("Contour plot of all vowels in the LOD database")
Warning: Ignoring unknown aesthetics: label
Instead of displaying each and every data point, draw only an ellipse for the extent of a vowel cloud. For each ellipse the middle point, the so-called centroid is also computed.
# plot centroids and ellipses
centroid = without_outliers %>%
group_by(labels) %>%
summarize(F1=mean(T1), F2=mean(T2))
ggplot(without_outliers) +
aes(x=T2,y=T1, col=labels,label=labels)+
stat_ellipse() +
geom_text(data=centroid,aes(x=F2,y=F1)) +
scale_y_reverse() + scale_x_reverse() +
labs(x = "F2 (Hz)", y = "F1 (Hz)") +
theme(legend.position="none") +
ggtitle("Dispersion of vowels in Luxembourgish")
A further possibility is to use the function facet_wrap which creates single plot for the factors of a variable. Here, individual plots are displayed according to the labels ( = vowels) in the dataframe. The elliptoid shape of each cloud is a result of the outlier removal procedure.
ggplot(without_outliers %>% filter(!(labels =="ɐ" | labels == 'ə' | labels == 'ɔː'))) +
aes(x = T2, y = T1, label = labels, col = labels) +
geom_text() +
scale_y_reverse() +
scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
labs(x = "F2 (Bark)", y = "F1 (Bark)") +
ggtitle("Scatter plot of all vowels in the LOD database") +
facet_wrap(~ labels)
From the visualisation shown so far it becomes obvious that vowels are not neatly separated but rather often show a certain amount of overlap. This is either due to measurement errors or is indicating that a vowel production is changing, i.e. that we are dealing with sound change. Sound change usually manifests itself in subconscious, minimal deviations from a former prototypical sound and these deviation may become more prominent at a later stage.
In Luxembourgish, too, several sound changes are in progress for the vowels. The last example in this tutorial covers the variational pattern of the two e-vowels, [e] (in words like Méck, bréngen, sécher) and [ə] (like in Schëff, Dësch). While for elder speakers these vowels are distinct, there is an ongoing merger for younger speaker by pronouncing the vowel before the certain consonants like [ə] instead of [e].
To tackle this task, we first need the relevant data, which has been prepared in a dataframe merger.
merger <- readRDS("e_merger_data.RDS")
With around 2,100 rows this dataset is not too large for a scatter plot. The relevant information is the horizontal overlap between the clouds. While two of the red ones ([k], [ŋ]) are quite distinct from the green one, it is the cloud for [ɕ] which is showing overlap with the cloud for [ʃ]. Here we have clear evidence of an ongoing vowel merger, i.e. the e-vowel realisations before [ɕ] and [ʃ] are becoming more similar.
ggplot(merger) +
aes(x = T2, y = T1, label = labels, col = labels) +
geom_text() +
scale_y_reverse() +
scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
labs(x = "F2 (Bark)", y = "F1 (Bark)") +
ggtitle("Scatter plot of the vowels [ə] and [e] in the LOD database") +
facet_wrap(~ next_label)
The actual amount of overlap between two cloud-like data sets can be calculated with the so-called Pillai score Details see here.
Basically, high Pillai scores indicate distinct clouds, low Pillai score indicate merger. (Source of figure)
Let’s now calculate the Pillai scores for the overlap. We extract the rows for [e] before [ɕ] and [ə] before [ʃ], respectively, and merge these to a temporary df ‘vowels’.
# select rows for the two sounds
e_ch_vowels <- merger %>%
filter((labels == "e" & next_label =="ɕ"))
e_sch_vowels <- merger %>%
filter((labels == "ə" & next_label =="ʃ"))
# new df with the selected rows
vowels <- rbind(e_ch_vowels, e_sch_vowels)
# calculate the Pillai score
pillai(cbind(T1, T2) ~ labels, data = vowels)
[1] 0.2756527
Remember, the Pillai score ranges between 0 and 1. The lower the score, the closer two groups of realisations. With this rather low value the Pillai score implies that the two vowel realisations are indeed rather close and merging into each other.